{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Validation and Test Sets.ipynb",
      "provenance": [],
      "private_outputs": true,
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "code",
      "metadata": {
        "id": "hMqWDc_m6rUC",
        "colab_type": "code",
        "cellView": "form",
        "colab": {}
      },
      "source": [
        "#@title Copyright 2020 Google LLC. Double-click here for license information.\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4f3CKqFUqL2-",
        "colab_type": "text"
      },
      "source": [
        "# Validation Sets and Test Sets\n",
        "\n",
        "The previous Colab exercises evaluated the trained model against the training set, which does not provide a strong signal about the quality of your model. In this Colab, you'll experiment with validation sets and test sets.\n",
        "\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3spZH_kNkWWX",
        "colab_type": "text"
      },
      "source": [
        "## Learning objectives\n",
        "\n",
        "After doing this Colab, you'll know how to do the following:\n",
        "\n",
        "  * Split a [training set](https://developers.google.com/machine-learning/glossary/#training_set) into a smaller training set and a [validation set](https://developers.google.com/machine-learning/glossary/#validation_set).\n",
        "  * Analyze deltas between training set and validation set results.\n",
        "  * Test the trained model with a [test set](https://developers.google.com/machine-learning/glossary/#test_set) to determine whether your trained model is [overfitting](https://developers.google.com/machine-learning/glossary/#overfitting).\n",
        "  * Detect and fix a common training problem."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gV82DJO3kWpk",
        "colab_type": "text"
      },
      "source": [
        "## The dataset\n",
        "\n",
        "As in the previous exercise, this exercise uses the [California Housing dataset](https://developers.google.com/machine-learning/crash-course/california-housing-data-description) to predict the `median_house_value` at the city block level.  Like many \"famous\" datasets, the California Housing Dataset actually consists of two separate datasets, each living in separate .csv files:\n",
        "\n",
        "* The training set is in `california_housing_train.csv`.\n",
        "* The test set is in `california_housing_test.csv`.\n",
        "\n",
        "You'll create the validation set by dividing the downloaded training set into two parts:\n",
        "\n",
        "* a smaller training set  \n",
        "* a validation set"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "u84mXopntPFZ",
        "colab_type": "text"
      },
      "source": [
        "## Use the right version of TensorFlow\n",
        "\n",
        "The following hidden code cell ensures that the Colab will run on TensorFlow 2.X."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "FBhNIdUatOU6",
        "colab_type": "code",
        "cellView": "form",
        "colab": {}
      },
      "source": [
        "#@title Run on TensorFlow 2.x\n",
        "%tensorflow_version 2.x"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S8gm6BpqRRuh",
        "colab_type": "text"
      },
      "source": [
        "## Import relevant modules\n",
        "\n",
        "As before, this first code cell imports the necessary modules and sets a few display options."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9D8GgUovHbG0",
        "colab_type": "code",
        "cellView": "form",
        "colab": {}
      },
      "source": [
        "#@title Import modules\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import tensorflow as tf\n",
        "from matplotlib import pyplot as plt\n",
        "\n",
        "pd.options.display.max_rows = 10\n",
        "pd.options.display.float_format = \"{:.1f}\".format"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xjvrrClQeAJu",
        "colab_type": "text"
      },
      "source": [
        "## Load the datasets from the internet\n",
        "\n",
        "The following code cell loads the separate .csv files and creates the following two pandas DataFrames:\n",
        "\n",
        "* `train_df`, which contains the training set.\n",
        "* `test_df`, which contains the test set.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zUnTc_wfd_o3",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "train_df = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\")\n",
        "test_df = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_test.csv\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "P_KBdj2M_yjM",
        "colab_type": "text"
      },
      "source": [
        "## Scale the label values\n",
        "\n",
        "The following code cell scales the `median_house_value`. \n",
        "See the previous Colab exercise for details."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3hc7QQhaAFXD",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "scale_factor = 1000.0\n",
        "\n",
        "# Scale the training set's label.\n",
        "train_df[\"median_house_value\"] /= scale_factor \n",
        "\n",
        "# Scale the test set's label\n",
        "test_df[\"median_house_value\"] /= scale_factor"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FhessIIV8VPc",
        "colab_type": "text"
      },
      "source": [
        "## Load the functions that build and train a model\n",
        "\n",
        "The following code cell defines two functions:\n",
        "\n",
        "  * `build_model`, which defines the model's topography.\n",
        "  * `train_model`, which will ultimately train the model, outputting not only the loss value for the training set but also the loss value for the validation set. \n",
        "\n",
        "Since you don't need to understand model building code right now, we've hidden this code cell. As always, you must run hidden code cells."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bvonhK857msj",
        "colab_type": "code",
        "cellView": "form",
        "colab": {}
      },
      "source": [
        "#@title Define the functions that build and train a model\n",
        "def build_model(my_learning_rate):\n",
        "  \"\"\"Create and compile a simple linear regression model.\"\"\"\n",
        "  # Most simple tf.keras models are sequential.\n",
        "  model = tf.keras.models.Sequential()\n",
        "\n",
        "  # Add one linear layer to the model to yield a simple linear regressor.\n",
        "  model.add(tf.keras.layers.Dense(units=1, input_shape=(1,)))\n",
        "\n",
        "  # Compile the model topography into code that TensorFlow can efficiently\n",
        "  # execute. Configure training to minimize the model's mean squared error. \n",
        "  model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=my_learning_rate),\n",
        "                loss=\"mean_squared_error\",\n",
        "                metrics=[tf.keras.metrics.RootMeanSquaredError()])\n",
        "\n",
        "  return model               \n",
        "\n",
        "\n",
        "def train_model(model, df, feature, label, my_epochs, \n",
        "                my_batch_size=None, my_validation_split=0.1):\n",
        "  \"\"\"Feed a dataset into the model in order to train it.\"\"\"\n",
        "\n",
        "  history = model.fit(x=df[feature],\n",
        "                      y=df[label],\n",
        "                      batch_size=my_batch_size,\n",
        "                      epochs=my_epochs,\n",
        "                      validation_split=my_validation_split)\n",
        "\n",
        "  # Gather the model's trained weight and bias.\n",
        "  trained_weight = model.get_weights()[0]\n",
        "  trained_bias = model.get_weights()[1]\n",
        "\n",
        "  # The list of epochs is stored separately from the \n",
        "  # rest of history.\n",
        "  epochs = history.epoch\n",
        "  \n",
        "  # Isolate the root mean squared error for each epoch.\n",
        "  hist = pd.DataFrame(history.history)\n",
        "  rmse = hist[\"root_mean_squared_error\"]\n",
        "\n",
        "  return epochs, rmse, history.history   \n",
        "\n",
        "print(\"Defined the build_model and train_model functions.\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8gRu4Ri0D8tH",
        "colab_type": "text"
      },
      "source": [
        "## Define plotting functions\n",
        "\n",
        "The `plot_the_loss_curve` function plots loss vs. epochs for both the training set and the validation set."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QA7hsqPZDvVM",
        "colab_type": "code",
        "cellView": "form",
        "colab": {}
      },
      "source": [
        "#@title Define the plotting function\n",
        "\n",
        "def plot_the_loss_curve(epochs, mae_training, mae_validation):\n",
        "  \"\"\"Plot a curve of loss vs. epoch.\"\"\"\n",
        "\n",
        "  plt.figure()\n",
        "  plt.xlabel(\"Epoch\")\n",
        "  plt.ylabel(\"Root Mean Squared Error\")\n",
        "\n",
        "  plt.plot(epochs[1:], mae_training[1:], label=\"Training Loss\")\n",
        "  plt.plot(epochs[1:], mae_validation[1:], label=\"Validation Loss\")\n",
        "  plt.legend()\n",
        "  \n",
        "  # We're not going to plot the first epoch, since the loss on the first epoch\n",
        "  # is often substantially greater than the loss for other epochs.\n",
        "  merged_mae_lists = mae_training[1:] + mae_validation[1:]\n",
        "  highest_loss = max(merged_mae_lists)\n",
        "  lowest_loss = min(merged_mae_lists)\n",
        "  delta = highest_loss - lowest_loss\n",
        "  print(delta)\n",
        "\n",
        "  top_of_y_axis = highest_loss + (delta * 0.05)\n",
        "  bottom_of_y_axis = lowest_loss - (delta * 0.05)\n",
        "   \n",
        "  plt.ylim([bottom_of_y_axis, top_of_y_axis])\n",
        "  plt.show()  \n",
        "\n",
        "print(\"Defined the plot_the_loss_curve function.\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jipBqEQXlsN8",
        "colab_type": "text"
      },
      "source": [
        "## Task 1: Experiment with the validation split\n",
        "\n",
        "In the following code cell, you'll see a variable named `validation_split`, which we've initialized at 0.2.  The `validation_split` variable specifies the proportion of the original training set that will serve as the validation set. The original training set contains 17,000 examples. Therefore, a `validation_split` of 0.2 means that:\n",
        "\n",
        "* 17,000 * 0.2 ~= 3,400 examples will become the validation set.\n",
        "* 17,000 * 0.8 ~= 13,600 examples will become the new training set.\n",
        "\n",
        "The following code builds a model, trains it on the training set, and evaluates the built model on both:\n",
        "\n",
        "* The training set.\n",
        "* And the validation set.\n",
        "\n",
        "If the data in the training set is similar to the data in the validation set, then the two loss curves and the final loss values should be almost identical. However, the loss curves and final loss values are **not** almost identical. Hmm, that's odd.  \n",
        "\n",
        "Experiment with two or three different values of `validation_split`.  Do different values of `validation_split` fix the problem? \n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "knP23Taoa00a",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# The following variables are the hyperparameters.\n",
        "learning_rate = 0.08\n",
        "epochs = 30\n",
        "batch_size = 100\n",
        "\n",
        "# Split the original training set into a reduced training set and a\n",
        "# validation set. \n",
        "validation_split=0.2\n",
        "\n",
        "# Identify the feature and the label.\n",
        "my_feature=\"median_income\"  # the median income on a specific city block.\n",
        "my_label=\"median_house_value\" # the median value of a house on a specific city block.\n",
        "# That is, you're going to create a model that predicts house value based \n",
        "# solely on the neighborhood's median income.  \n",
        "\n",
        "# Discard any pre-existing version of the model.\n",
        "my_model = None\n",
        "\n",
        "# Invoke the functions to build and train the model.\n",
        "my_model = build_model(learning_rate)\n",
        "epochs, rmse, history = train_model(my_model, train_df, my_feature, \n",
        "                                    my_label, epochs, batch_size, \n",
        "                                    validation_split)\n",
        "\n",
        "plot_the_loss_curve(epochs, history[\"root_mean_squared_error\"], \n",
        "                    history[\"val_root_mean_squared_error\"])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TKa11JK4Pm3f",
        "colab_type": "text"
      },
      "source": [
        "## Task 2: Determine **why** the loss curves differ\n",
        "\n",
        "No matter how you split the training set and the validation set, the loss curves differ significantly. Evidently, the data in the training set isn't similar enough to the data in the validation set. Counterintuitive? Yes, but this problem is actually pretty common in machine learning. \n",
        "\n",
        "Your task is to determine **why** the loss curves aren't highly similar. As with most issues in machine learning, the problem is rooted in the data itself. To solve this mystery of why the training set and validation set aren't almost identical, write a line or two of [pandas code](https://colab.research.google.com/github/google/eng-edu/blob/master/ml/cc/exercises/pandas_dataframe_ultraquick_tutorial.ipynb?utm_source=validation-colab&utm_medium=colab&utm_campaign=colab-external&utm_content=pandas_tf2-colab&hl=en) in the following code cell.  Here are a couple of hints:\n",
        "\n",
        "  * The previous code cell split the original training set into:\n",
        "    * a reduced training set (the original training set - the validation set)\n",
        "    * the validation set \n",
        "  * By default, the pandas [`head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method outputs the *first* 5 rows of the DataFrame. To see more of the training set, specify the `n` argument to `head` and assign a large positive integer to `n`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VJQcAZkwJt_p",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Write some code in this code cell."
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "EnNvkFwwK8WY",
        "colab_type": "code",
        "cellView": "form",
        "colab": {}
      },
      "source": [
        "#@title Double-click for a possible solution to Task 2.\n",
        "\n",
        "# Examine examples 0 through 4 and examples 25 through 29\n",
        "# of the training set\n",
        "train_df.head(n=1000)\n",
        "\n",
        "# The original training set is sorted by longitude. \n",
        "# Apparently, longitude influences the relationship of\n",
        "# total_rooms to median_house_value."
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rw4xI1ZEckI8",
        "colab_type": "text"
      },
      "source": [
        "## Task 3. Fix the problem\n",
        "\n",
        "To fix the problem, shuffle the examples in the training set before splitting the examples into a training set and validation set. To do so, take the following steps:\n",
        "\n",
        "1. Shuffle the data in the training set by adding the following line anywhere before you call `train_model` (in the code cell associated with Task 1):\n",
        "\n",
        "```\n",
        "  shuffled_train_df = train_df.reindex(np.random.permutation(train_df.index))\n",
        "```                                    \n",
        "\n",
        "2. Pass `shuffled_train_df` (instead of `train_df`) as the second argument to `train_model` (in the code call associated with Task 1) so that the call becomes as follows:\n",
        "\n",
        "```\n",
        "  epochs, rmse, history = train_model(my_model, shuffled_train_df, my_feature, \n",
        "                                      my_label, epochs, batch_size, \n",
        "                                      validation_split)\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ncODhpv0h-LG",
        "colab_type": "code",
        "cellView": "form",
        "colab": {}
      },
      "source": [
        "#@title Double-click to view the complete implementation.\n",
        "\n",
        "# The following variables are the hyperparameters.\n",
        "learning_rate = 0.08\n",
        "epochs = 70\n",
        "batch_size = 100\n",
        "\n",
        "# Split the original training set into a reduced training set and a\n",
        "# validation set. \n",
        "validation_split=0.2\n",
        "\n",
        "# Identify the feature and the label.\n",
        "my_feature=\"median_income\"  # the median income on a specific city block.\n",
        "my_label=\"median_house_value\" # the median value of a house on a specific city block.\n",
        "# That is, you're going to create a model that predicts house value based \n",
        "# solely on the neighborhood's median income.  \n",
        "\n",
        "# Discard any pre-existing version of the model.\n",
        "my_model = None\n",
        "\n",
        "# Shuffle the examples.\n",
        "shuffled_train_df = train_df.reindex(np.random.permutation(train_df.index)) \n",
        "\n",
        "# Invoke the functions to build and train the model. Train on the shuffled\n",
        "# training set.\n",
        "my_model = build_model(learning_rate)\n",
        "epochs, rmse, history = train_model(my_model, shuffled_train_df, my_feature, \n",
        "                                    my_label, epochs, batch_size, \n",
        "                                    validation_split)\n",
        "\n",
        "plot_the_loss_curve(epochs, history[\"root_mean_squared_error\"], \n",
        "                    history[\"val_root_mean_squared_error\"])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tKN239_miW8C",
        "colab_type": "text"
      },
      "source": [
        "Experiment with `validation_split` to answer the following questions:\n",
        "\n",
        "* With the training set shuffled, is the final loss for the training set closer to the final loss for the validation set?  \n",
        "* At what range of values of `validation_split` do the final loss values for the training set and validation set diverge meaningfully?  Why?"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-UAJ3Q86iz31",
        "colab_type": "code",
        "cellView": "form",
        "colab": {}
      },
      "source": [
        "#@title Double-click for the answers to the questions\n",
        "\n",
        "# Yes, after shuffling the original training set, \n",
        "# the final loss for the training set and the \n",
        "# validation set become much closer.\n",
        "\n",
        "# If validation_split < 0.15,\n",
        "# the final loss values for the training set and\n",
        "# validation set diverge meaningfully.  Apparently,\n",
        "# the validation set no longer contains enough examples. "
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1PP-O8TOZOeo",
        "colab_type": "text"
      },
      "source": [
        "## Task 4: Use the Test Dataset to Evaluate Your Model's Performance\n",
        "\n",
        "The test set usually acts as the ultimate judge of a model's quality. The test set can serve as an impartial judge because its examples haven't been used in training the model. Run the following code cell to evaluate the model with the test set:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "nd_Sw2cygOip",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "x_test = test_df[my_feature]\n",
        "y_test = test_df[my_label]\n",
        "\n",
        "results = my_model.evaluate(x_test, y_test, batch_size=batch_size)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qoyQKvsjmV_A",
        "colab_type": "text"
      },
      "source": [
        "Compare the root mean squared error of the model when evaluated on each of the three datasets:\n",
        "\n",
        "* training set: look for `root_mean_squared_error` in the final training epoch.\n",
        "* validation set: look for `val_root_mean_squared_error` in the final training epoch.\n",
        "* test set: run the preceding code cell and examine the `root_mean_squred_error`.\n",
        "\n",
        "Ideally, the root mean squared error of all three sets should be similar. Are they?"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "FxXtp-aVdIgJ",
        "colab_type": "code",
        "cellView": "form",
        "colab": {}
      },
      "source": [
        "#@title Double-click for an answer\n",
        "\n",
        "# In our experiments, yes, the rmse values \n",
        "# were similar enough. "
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}
